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DYNAMIC OPTIMIZATION OF COMPUTER to object code translation, as explained more fully in U.S. 

PROGRAMS USING CODE-REWRITING patent application No, 5,815,720, issued to William B. 

KERNAL MODULE Buzbee on Sep. 29, 1998, and incorporated by reference 

herein. 

5 Once the code is instrumented and executed, the dynamic 

FIELD OF THE INVENTION optimization program collects the profile data generated by 

„ . . , . J . the instrumentation. The dynamic optimization program 

Hie present mvention relates to a method and apparatus ^^^^ ^^^j^ ^^^^^ ^^^^ for example, for "hot" 

for the dynamic optimization of computer programs. instruction paths (series of consecutive instructionsjh^ are 

BACKGROUND TO THE INVENTION 10 oft en du rin g execut ion)/pie dynamic optimization 

program"theD optimizes portions of the computer program 

The source code of a computer program is generally ^ed on the^tbfile^data. 

written in a high-level language that is humanly readable. Dynamic optinaization is generally accomplished without 

such as FORTRAN or C. The source code is translated by a recompilation of the source code. Rather, the dynamic 

compiler program into an assembly language. The binary optimization program rewrites a portion of the object code 

form of the assembly language, called object code, is the in optimal form and stores that optimized translation into a 

form of the computer program that is actually executed by code cache .^A-hot~instruction' path, for example, might be 

a computer. Object code is generally comprised of assembly optimized by^moving'that serierof instructions into sequen- 

language or machine language for a target machine, such as tial cache memory locations. Once the optimized transla- 

a Hewlett-Packard PA-RISC microprocessor-based com- tions are written into the code cache, the-dynamiCoptimi^ 

puter. The object code is generally first produced in object ^zationjprogram switches executibn'flow^^ntrono tfie^ 

code modules, which are linked together by a linker. For the ^optimized translations in" the~code^a^ 

purposes of the present invention, the term "compile" refers optimized instructions are thereafter called, 

to the process of both producing the object code modules prior art dynamic optimizers have several disadvantages, 

and linking them together. First, because the dynamic optimization program is located 

Because computer programs are written as source code by in user memory space, it has limited privileges. Computer 
humans, it is usually not written in a way to achieve optimal memory is allocated by the computer's operating system 
performance when eventually executed as object code by a into several categories. One demarcation is between user 
computer. Computer programs can be optimized in a variety memory space and kernel memory space. Kernel memory 
of ways. For example, optimizing compilers perform static 3Q space is generally reserved for the computer operating 
optimization before the code is executed. Static optimization system kernel and associated programs. Programs residing 
can be based on particular rules or assumptions (e.g., assura- in kernel memory space have unrestricted privileges, includ- 
ing that all "branches" within a code are "Taken"), or can be ing the ability to write and overwrite in u.ser memory space, 
profile-based. To perform profile-based optimization By contrast, programs residing in user space have limited 
("PBO"), the code is executed under test conditions, and 35 privileges, which causes significant problems when per- 
profile information about the performance of the code is forming dynamic optimization in the user space, 
collected. That profile information is fed back to the Modem computers permit a computer program to share its 
compiler, which recompiles the source code using the profile program text with other concurrently executing instances of 
information to optimize performance. For example, if cer- the same program to better utilize machine resources. For 
tain procedures call each other frequently, the compiler can ^ example, three people running the same word processing 
place them close together in the object code file, resulting in program from a single computer system (whether it be a 
fewer instruction cache misses when the application is client server, multiple-CPU computer, etc.) will share the 
executed. same computer program text in memory. The program text. 

Dynamic optimization refers to the practice of optimizing therefore, sits in "shared user memory space," which is 

computer programs as they execute. Dynamic optimization 45 accessible by any process running on the computer system, 

differs from static optimization in that it occurs during As explained above, however, in order to perform 

runtime, not during compilation before runtime. Generally, dynamic optimization, the dynamic optimization program 

dynamic optimization is accomplished as follows. While a must be able to alter the program text to direct flow of 

computer program is executing, a separate dynamic optimi- control to the optimized translations in the code cache, 

zation program observes the executing computer program 50 Because the dynamic optimization program sits in user 

and collects profile data. The dynamic optimization program memory space, it does not have the privilege to write into the 

can be implemented as a dynamically loadable library program text. Accordingly, as illustrated in FIG, 1, the 

(DLL), as a subprogram inserted into the computer program program text must be emulated so that the dynamic optimi- 

by a compiler before runtime, or by a variety of other means zation can take place in private memory space dedicated to 

known in the art. 55 the particular process being executed. 

The profile data can be collected by "instrumenting" the Referring to FIG. 1, computer program text 10 (object 

object code. Instrumentation of code refers to the process of code after compilation), is emulated by a software emulator 

adding code that generates specific information to a log 20 that is included within a dynamic optimization program 

during execution. The dynamic optimization program uses 30. "Emulation" refers to the "software execution" of the 

that log to collect profile data. Instrumentation allows col- 60 computer program text 10. The computer program text 10 is 

lection of the minimum specific data required to perform a never run natively. Rather, the emulator 20 reads the com- 

particular analysis. General purpose trace tools can also be puter program text 10 as data. The dynamic optimization 

used as an alternative method for collecting data. Instm- program 30 (including the emulator 20) ordinarily takes the 

mentation can be performed by a compiler during translation form of a shared library that attaches to different processes 

of source code to object code, lliose skilled in the art will 65 running the computer program text 10, In the example 

recognize that the object code can also be directly instm- shown in FIG. 1, three different instances of the same 

menied by a dynamic translator performing an object code computer program 10 are being run via processes A, B, & C. 
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Focusing on Process A, for example, the dynamic opti- BRIEF DESCRIPTION OF THE DRAWINGS 

mization program 30 collects profile data during emulation , . - 

Of instructions for Process A. Based on that profile data, the \ ^ ^ ^^^S^^ ^ ^y°^°^^^ optimizaUon 

dynamic optimization program 30 creates optimized trans- system of the prior art. 

lations of portions of the instructions called by Process A and 5 FIG. 2 is a block diagram of a computer system according 

inserts them into an optimized translation code cache 40 to the present invention. 

stored in private memory space allocated for Process A. FIG. 3 is a block diagram illustrating a preferred embodi- 

Thereafter, when a previously optimized instruction is called juent of the present invention. 

by Process A, flow of control is forward^ to the optimdzed pj^^ ^ ^ ^^^^ illustrating a preferred method of the 

translation for that instruciion in the code cache 40. Opti- p^sent invention for dynamically optimizing computer pro- 

mization of instructions for each of the other processes (B „„„^ • „^ „; & *- 

J , , . ^ ^ grams m user memory space, 

and C) works in the same manner. * « . 

There are several problems with this approach. First, ^ ^ ' ?7 ^^^[J Ulustrating in greater detail the 

runningacomputerprograml0throughanemulator20ison ^'""^'''^ ''^^ ""^'^^ ^ 

the order of fifty times slower than running the computer 35 P^^. 6 is a block diagram illustrating a kernel module 

program 10 natively. Second, because the optimized trans- 1° preferred embodunent of the present invention, 

lations are stored in private user memory space specific to FIG. 7 is a flow chart illustrating a preferred method of the 

particular processes, they are not shared across the computer present invention for dynamically optimizing a computer 

system, thereby causing significant duplication of work. operating system kernel. 

Individual processes do not have access to other processes* 20 hftaii Fn nF^rRim^ON of thf 

optimized translations. Finally, because the dynamic opti- TN^SSton 
mization program 10 does not have the privflege of writing 

into kernel space, there is no way to optimize the computer FIG. 2 is a block diagram of a computer system 50 that is 

operating system kernel. used to implement the methods and apparatus embodying 

What is needed is a method and apparatus to dynamically 25 present invention. The computer system 50 includes as 

optimize computer programs that permits optimized trans- its basic elements: several CPUs 60, a memory 70, and an 

lations to be shared across a computer system, I/O controller 80. The CPUs 60 preferably include perfor- 

What is needed is a method and apparatus to dynamically mance monitoring units (PMUs 90), which are able to 

optimize computer programs without degrading the perfor- nonintrusively collect profile information about instructions 

mance of the programs. 30 executed by the CPUs 60. The memory 70 includes within 

What is needed is a method and apparatus to dynamically among other things, allocated kernel memory space 100 

optimize computer programs that permits optimization of memory space 110/ In addition, the user memory 

the computer operating system kernel. space 110 is allocated into shared user memory space 120 

and private process user memory space 130. The CPUs 60, 

SUMMARY OF THE INVENTION memory 70, and I/O controller 80 are all connected via a bus 

The present invention solves the problems of the prior art structure. The I/O controller 80 controls access to and 

by using a kernel module to perform dynamic optimizations information from external devices such as keyboards 140, 

both of user programs and of the computer operating system monitors 150, permanent storage 160, and a removable 

kernel, itself. The kernel module permits optimized transla- media unit 170. In addition, the computer system may be 

tions to be shared across a computer system without emu- 40 connected through a network connection 180 to other com- 

lation because the kernel module has the privileges neces- puter systems. 

sary to write into the computer program text in shared user h should be understood that FIG. 2 is a block diagram 

memory space. In addition, the kernel module can be used illustrating the basic elements of a computer system. This 

to optimize the kernel itself because it, too, is located in the figure is not intended to illustrate a specific architecture for 

kernel memory space. 45 the computer system 50 of the present invention. For 

The method of the present invention generally comprises example, no particular bus structure is shown because vari- 

loading a computer program to be optimized into shared user ous bus structures known in the field of computer design 

memory space; executing the computer program; analyzing may be used to interconnect the elements of the computer 

the computer program as it executes, including the substep system 50 in a number of ways, as desired. Each of the CPUs 

of collecting profile data about the computer program; 50 60 may be comprised of a discrete arithmetic logic unit 

providing profile data to a kernel module located in kernel (ALU), registers, and control unit or may be a single device 

memory space; generating at least one optimized translation in which these parts of the CPU are integrated together, such 

of at least one portion of the computer program using the as in a microprocessor. Moreover, the number and arrange- 

kernel module; and patching the computer program in ment of the elements of the computer system 50 may be 

shared user memory space using the at least one optimized 55 varied from what is shown and described in ways known in 

translation as the computer program continues to execute. the art (i.e., client server systems, computer networks, etc.) 

The apparatus of the present invention generally com- The operation of the computer system 50 depicted in FIG. 2 

prises at least one processor, adapted to execute computer is described in greater detail in relation to FIGS. 3 through 

programs, including computer programs that are part of a 7. 

computer operating system program kernel; and a memory, 60 FIG. 3 is a block diagram of a preferred embodiment of 

operatively connected to the processor, adapted to store in the present invention. In this embodiment, a source code 190 

kernel memory space a computer operating system program is fed to a compiler 200 where it is compiled into an object 

kernel and a code-rewriting kernel module, wherein the code representing the computer program text 210. The 

code-rewriting kernel module is adapted to receive profile object code 210 is loaded into shared user memory space, by 

information regarding a computer program while it is 65 the computer operating system kernel 220 or a dynamic 

executing and to optimize at least a portion of that executing loader. If the computer program text 210 is enabled for 

computer program. dynamic optimization, the dynamic loader will also load into 
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shared user space a "dynamic optimization helpei^' 230. The 
dynamic optimization helper 230 can take the form of a 
shared library. In addition, the dynamic loader (or computer 
operating system kerne! 220) can enable every program 210 
for dynamic optimization as a default, or that decision can 5 
be made based on characteristics of the computer program 
210 or by the user himself. 

As an alternative, the dynamic optimization helper 230 
can take the form of a "daemon" process spawned by the 
computer operating system kernel 220. A daemon is a 
continually running computer process. Here, the daemon can 
have multiple threads to accommodate the dynamic optimi- 
zation of several processes running the computer program 
text 210. 

As described more specifically with relation to FIG. 4, the ^5 
dynamic optimization helper 230 in shared user space 
(whether a shared library or a daemon process) is respon- 
sible for collecting profile data, preprocessing the profile 
data, and sending a streamlined form of profile data to a 
code-rewriting kernel module 240 located in kernel memory 20 
space. The code -rewriting kernel module 240 analyzes the 
profile data and generates optimized translations of portions 
of the computer program 210 from the profile data. The 
kernel module 240 then writes the optimized translations 
into a code cache 250 in shared user memory space. The 25 
dcernel mod^e~240~also4nserts'juinp7ins^uctions~into~the\^ 
computer^ program text 210 to switch~execu^ionL flpw_of . ? 
control to thl; optimized tfanslation^s whenever a n optimized^ 
f imtmctipn within the computet^ program t^r210 is calledj 
by~a" process; Because the computer~program text 210 does 30 
not need "to" be emulated and the optimized instruction code 
cache 250 is in shared user memory space, all processes 
supported by the computer system 50 have the benefit of the 
optimized translations. 

in addition, the embodiment illustrated by FIG. 3 permits 35 
the optimization of the computer operating system kernel 
220, which is stored in kernel memory space along with the 
code-rewriting kernel module 240. Preferably, when the 
computer operating system kernel 220 is first initialized, a 
dynamic optimization helper 260, similar to the dynamic 40 
optimization helper 230 in shared user space, is attached to 
the computer operating system kernel 220 and loaded into 
kernel memory space. The dynamic optimization helper 260 
in kernel memory space operates similarly to the dynamic 
optimization helper 230 in shared user memory space. It 45 
collects profile data regarding the executing computer oper- 
ating system kernel 220, processes that profile data, and 
provides it to the code-rewriting kernel module 240. The 
kernel module 240 analyzes the profile data and generates 
optimized translations of portions of the computer operating 50 
system kernel 220 from the profile data. The code-rewriting 
kernel module 240 then writes the optimized translations 
into a code cache 270 in the kernel memory space. The 
kernel module 240 also inserts jump instructions into the 
computer operating system kernel 220 to switch execution 55 
flow of control to the optimized translations whenever an 
optimized instruction within the computer operating system 
kernel 220 is called. 

FIG. 4 illustrates the basic method of the present inven- 
tion for optimizing shared user computer programs. A source 60 
code 190 is first compiled 290. Compilation 290 results in 
computer program text 210 in the form of object code. The 
computer program text 210 is then loaded 310 into shared 
user memory space. If a shared-library dynamic optimiza- 
tion helper 230 is employed, it is also loaded into shared user 65 
memory space. As discussed, a daemon-process dynamic 
optimization program can alternatively be employed. Unless 
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Otherwise noted, for the remaining discussion of FIGS. 4-7, 
it is assumed that a shared-library dynamic optimization 
helper 230 is employed. The computer program text 210 is 
then executed 320 by one or more of the CPUs 60. During 
execution, the dynamic optimization helper 230 analyzes 
330 the executing instructions and collects profile data. As 
part of this analysis, the dynamic optimization helper 230 
searches for optimization opportunities. For example, the 
dynamic optimization helper 230 searches for "hot" instruc- 
tion paths, as previously described. Typically, the dynamic 
optimization helper 230 will not grow traces that pass from 
the computer program text 210 through shared libraries. 
Otherwise, a user might unload a particular shared library 
and load a new one during execution, which would cause 
errors to occur if part of an optimized patch depended on the 
old shared library. It is preferred, therefore, that the helper 
230 not grow traces through a shared library unless it can 
verify that the user does not have the capability to unload 
that shared Library. Depending on the computer operating 
system employed, that determination is made possible by 
examining API system calls. 

Profile data is preferably collected via the PMUs 90 
included in the CPUs 60. Although the exact operation of 
PMUs 90 varies with different processors, generally they 
operate as follows. A PMU 90 includes multiple counters 
programmable to count events like: clock cycles; instruc- 
tions retired; and number of stalls caused by events such as 
data cache/TLB misses, instmction cache/FLB misses, pipe- 
line stalls, etc. Sophisticated PMUs 90 include trace buffers 
that record the history of branches and branch prediction, 
and addresses that caused cache/TLB misses, etc. PMU 
counters can be programmed to trigger an interrupt after a 
certain number of events, at which time the PMU counter 
values can be read. This method of reading PMU counters is 
known as sampling. The dynamic optimization helper 230 
samples the PMU 90 history periodically to gather profile 
data and save it to a log. If the particular PMU 90 employed 
does not permit the dynamic optimization helper 230 to read 
or rcprogram it from user memory space, the dynamic 
optimization helper 230 makes a system call to request that 
the computer operating system kernel 220 perform those 
functions. Using PMUs 90 to collect profile data is preferred 
over the method of instrumenting the object code during 
compilation because PMU 90 collection of profile data is 
nonintrusive and does not significantly degrade processing 
speed like instrumentation often does. 

Preferably, the dynamic optimization helper 230 does not 
provide 340 all of the profile data to the kernel module 240. 
Rather, it is preferred that the dynamic optimization helper 
230 streamline the profile data sent to the kernel module 240 
because the computer operating system kernel 220 will 
allocate only a certain amount of processing time to the 
kernel module 240 to perform dynamic optimizations. 
Accordingly, the dynamic optimization helper 230 prefer- 
ably provides only selected profile data, such as "annotated 
traces" of hot instruction paths, A "trace" is a series of basic 
blocks of code identified by the starting address of each 
basic block. "Annotations" to those traces may include 
summary profile data relating to particular traces. The sum- 
mary profile data is preferably generated by the dynamic 
optimization helper 230, which preprocesses the raw profile 
data from the PMUs 90 in the manner known in the art. 
Examples of annotations include: probability that branches 
between basic blocks are taken or not taken; number and 
addresses of cache misses; amount of processing time spent 
in a trace, etc. 

Once the streamlined profile data is provided 340 to the 
kernel module 240, the kernel module 240 uses that profile 
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data to generate 350 optimized translations of certain por- 
tions of the computer program text 210. FIG. 5 illustrates io 
greater detail the step of generating 350 optimized transla- 
tions. The kernel module 240 examines 360 the profile data 
and selects a portion of the computer program text 210 to 
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optimize. For example, .he kernel module 240 mighl select „ ^^^^^ ^, j^^^ ^ ^^^^^ 

a secuon of computer program text 210 that includes aU of ^ode cache 250 point back to the computer program text 

the mstructions identified as a hot instruction path by the 210 

dynamic optimization helper 230. The kernel module 240 When inserting jump instructions into the computer pro- 
copies 370 the selected computer program text 210 from gram text 210, it is important that the kernel module 240 
shared user memory space into the kernel memory space, ensure atomicity. Processors ensure a certain size of "atomic 
The computer program text 210 is then decoded 380 from unit" — the maximum size of instruction or data that the 
object code into an intermediate code so that it can be processor guarantees will not be interrupted during writing 
optimized 390. Once optimized 390, the intermediate code or execution. For most processors, the atomic unit is four 
is encoded 400 back into object code. TTiose of ordinary skill 1^5^^ '^*^^^")- ^^^io^' ^ jump instruction accord- 
. ^ ... 1 J J- ' iufi to the present mvenlion will exceed the size of the 
in the art will recognize that the decoding 380, optimization ^^^^.^ unit guaranteed by the appUcable CPU 60. In that 
390, and encodmg 400 of the selected computer program -^^^^^^^^ the following procedure may be used to ensure 
text 210 can be accomplished in a variety of ways, and the atomicity. 

present invention is not limited to any particular method for 20 After the optimized translation is written into the code 

accomplishing those processes. cache, a one -word "cookie" is placed in the computer 

„ - . ... T-i-r> A • J . 1 *• • program text 210 where the jump instruction is to be 

Re fernng back to FIG. 4, once an optimized translation IS • . j tr * j *i. 1 • j- * a r . 

, , ° , , 1 , 1, ,uAta*w mserted. If executed, the cookie du-ects flow of control to an 

generated by the kernel module, it is used to patch 410 the ... ... . . j, • * n j u *u j • • 

* . ^.-^mA Ji. • a . t.' » Ati\ illegal instruction handler mstalled by the dynamic optimi- 

computer program text 210. As used herem, patching 410 . , ™_ ... . : .. .. 

c , *u J *i- * r* - J * 25 zation helper 230. Then the jump instruction IS written mto 

refers broadly to any method of utilizing optimized trans- . \ . . -fin j 1 • .u r. 

, • r\ .u J • • t .u 11 . . A' T^r^ the computer program text 210, and the cookie IS thereafter 

lations. One method for domg so is further illustrated m FIG. j , . j , • -c r .i_ • * t_ • 

. ™ , , J , .. • J . 1 deleted. In thus manner, if one of the instructioas being 

4. The kernel module writes 420 the optimized translation . • j • n j u *u u i .1. • 

^ L -»rrt 11 . J . A c .V. optimized is called by another process while the lump 

mto a code cache 250 allocated at the end of the same . ... , ... • ,1 

... . . . -nft ■ instruction is being wntten, there is no danger that a partially 

memory space where the computer program text 210 is ^„ . . .,, . * JT n *u a c 

J Tr u J X. '^In. • r . u 1 1 J 1 30 wntten jump instruction will be executed. Rather, flow of 

stored. If no such code cache 250 exists, the kernel module 1 ii^^ . j . . 1 • . 1 i» n 1 

„ . , .... ... , . , ^At\' ^ control will be directed by the cookie to the kernels illegal 

240 allocates one. In addition, the kernel module 240 mserts .... . u- u n j . t i. 1 . !l 

. • . . • * *u . * * HA instruction handler, which will send control back to the 

430 jump instructions into the computer program text 210 . * .-hai_ u.* .i_ • • . 

i- \- n c . 1 ^ T- • 1 . computer program text 210, by which time the jump instruc- 

directing now of execution control to the optimized trans- ... , . 1 ■ ^ jV^-ift 

... L r.L • . • 1 J J tion IS completely mserted 430. 

lation whenever one of the instructions included within the 1 1 1 j 1 

..... ... . tijr™_i 1 ^ A • ^5 Alternatively, the kernel module 240 can guarantee ato- 

optimized translation is called. The kernel module 240 is . r 1, i_ « » r . . . 

... J . • . .1. . . * HA micity as follows. Each page of computer program text 

permitted to write into the computer program text 210 i- . r • • /r. j • . \ r» r 

r . . ji 1 J 1 -»><A if » • . J 210 has a ust of permissions (Read, Wnte, Execute). Before 

because, as a trusted kernel module 240, It has unrestricted ui 

... . .... 1 jj-.' J- J any mstruction is executed, the processor checks permis- 

pnvileges to wnte mto user space. In addition, as discussed, • /n j j i- . \ j • / u .l Tl 

L " ^. • J . I . J - u J sions (Read and Execute) and access nghts (whether the user 

because the optimized translations are stored m shared user . ^ .... \ ^ 1 . 

, .. ... . . 40 has access to that page). To guarantee atomicity, the kernel 

memory space and the present invention does not require an , , , *u c . . ^ ha 

1 / \i fr * • *u * module makes the page of computer program text 210 

emulator different processes rumimg the same computer ^^^^ j^^^^ instruction(s) nonexecutable dur- 

program text 210 on the same computer system 50 have the . * u- ac. .i_ • • \ 1.1 

u f *u *• • J i 1 *• mg patchmg 410. After the jump mstruction is completely 

benefit of the optimized translations. . * j^^a- * : . .-^ia .i_ 1 1 

^ mserted 430 mto the computer program text 210, the kernel 

The following is a general example of how a jump 45 module 240 resets the Execute permission bit. Thus, if a 

instruction is inserted 430. Assume that the computer pro- process attempts to execute the page while the jiunp instruc- 

gram text 210 reads as follows; tion is being written, the process is "put to sleep." The kernel 

module 240 then awakens the process once the jump instruc- 
tion is inserted 430 and turns the Execute bit back on. 

50 When inserting 430 jump instructions into the computer 

oxiio Ids add 'br^ program text 210, the "reachability" of the jump instruction 

0x220 suSbr call ^ust also be taken into account. The jump instruction must 

be a direct branch that specifies the jump target as an offset 

from the current program counter address. The reachability 

55 of the jump instruction is based on the number of bits 

If a "hot trace" is identified as beginning at 0x110, the allocated in the instruction to specify the offset. Modem 

present invention replaces the bundle at 0x110 with a branch processors are increasing the number of bits for the offset to 

jumping to the optimized trace in the code cache 250: increase the reachability, and certain processors provide 

complete reachability within a 64-bit address space. The 

60 present invention, however, is not restricted by reachability 

" ~~ l of jump instructions. While reachable jump instructions 

oxlOO nop ld4 sub 1 .iT ■ 1 . r . . . . 

0x110 nop nop br OxffD -> jump to the code cache ^ake the implementation of the present invention easier, it 

0x120 su8br call is Still implementable when reachable jump instmctions are 

not available. If the processor does not have a jump instruc- 

65 lion thai can reach from the patch location to the translation 

'Ihe optimized trace in the code cache 250 then appears as in the code cache 250, the jump needs to be effected in 

follows: multiple hops. The hops (except the last one) are to a 
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reachable jumpstatioa locations. A jumpstation can be free For example, if the kernel module 240 has performed a 

space that is serendipitously available, or free space left in high concentration of its previous optimizations in one area 

the program text during compilation 290. In the latter case, of the computer program text 210, the dynamic optimization 

the program needs to be compiled with a special flag and helper 230 may begin sending the kernel module 240 profile 

results in larger program text than before. 5 information in higher concentrations from other parts of the 

After the kernel module 240 patches 410 the computer computer program text 210. One way to accomplish this is 
program text 210, it preferably generates 350 another opti- to program the dynamic optimization helper 230 not to form 
mized translation in the same manner. The kernel module traces of hot instruction paths that include basic blocks 
240 examines 360 additional profile data previously pro- residing in the code cache 250. Those of ordinary skill in the 
vided by the dynamic optimization helper 230 and selects a lo art will recognize that the particular decisions made by the 
different portion of computer program text 210 for optimi- dynamic optimization helper 230 to send or not send par- 
zation. That newly selected portion of the computer program ticular profile data to the kernel module 240 will be based on 
text 210 is then copied 370, decoded 380, optimized 390, a variety of factors, and the present invention is not limited 
encoded 400, and used to patch 410 the computer program to any particular method in that regard. When the dynamic 
text 210 in the manner previously described. This loop is as optimization helper 230 has gathered sufficient additional 
repeated until the kernel module 240 has utilized all of the profile information, it makes another system call and pro- 
profile data provided by the dynamic optimization helper vides 340 that profile data to the kernel module 240. 
230 or the time allotted by the computer operating system Importantly, the method and apparatus of the present 
kernel 220 for dynamic optimization runs out. invention can also be used to optimize shared libraries called 

FIG. 6 provides more detail regarding the structure of the 20 by a computer program 210. A code cache for each such 
kernel module 240. The kernel module 240 can be imple- shared library is allocated at the end of the shared user 
mented as a device driver or a dynamically loadable kernel memory space occupied by the library. In addition, dynamic 
module (DLKM) and preferably includes a memory buffer optimization can be disabled for a particular computer 
440, a translation pipe 450, and a resource manager 460. program 210 or for particular subprograms. In the latter 
Profile data sent by the dynamic optimization helper 230 for 25 case, a compiler can be used lo annotate the computer 
particular processes (e.g., A, B, & C) are stored in the FIFO program text 210 to specify regions of the computer pro- 
memory buffer 440. The buffer 440 and all of the other gram text 210 from which the dynamic optimization helper 
functions of the kernel module 240 are managed by the 230 should not select traces for optimization, 
resource manager 460. The resource manager 460 allocates FIG. 7 illustrates the basic method of the present inven- 
space in the buffer 440 for profile data from different 30 tion for dynamically optimizing the computer operating 
processes. The resource manager 460 also allocates time for system kernel 220. This embodiment works essentially the 
the translation pipe 450 to perform optimized translations. same way as the embodiment illustrated in FIG. 4, except 

For example, the resource manager 460 allocates a preset that the computer program being optimized is the computer 

time for optimization using profile data of a particular operating system kernel 220, and all of the components of 

process. When that time expires, the resource manager 460 35 the system are in kernel memory space. First, the computer 

completes any pending translations, and then clears that operating system kernel instructions are executed 470, 

block of profile data from the buffer 440, In addition, if the which will generally be true any time that the computer 

buffer 440 is full and a dynamic helper 230 makes a system system 50 is operating. While the computer operating sys- 

call for optimization, the resource manager 460 sends back tern kernel 220 is executing, a kernel dynamic optimization 

an error message. Also, the resource manager 460 ceases 40 helper 260, also located in kernel memory space, analyzes 

optimization if a process makes a higher priority system call 480 the executing instructions and collects profile data 

(e.g., an intermpt during the optimization). regarding the computer operating system kernel 220. 

The translation pipe 450 reads the profile data and makes In the manner previously described in relation to FIG. 4, 

decisions as to which part of the computer program text 210 the profile data is compacted and provided 490 to the kernel 

to optimize. Optionally, that decision is made by the 45 module 240. Again, the kernel module 240 generates 500 

dynamic optimization helper 230 and the kernel module 240 optimized translations based on the profile data, preferably 

simply optimizes all traces sent to it (to the extent possible using the same steps as described in relation to FIG. 5. Once 

in the time allotted by the resource manager 460). Whatever the kernel module 240 generates 500 the optimized 

traces (or hot instruction paths) to be optimized are then sent translations, they are again used to patch 510 the computer 

to the translation pipe 450, which performs all of the 50 operating system kernel 220. This can be accomplished 

operations shown and discussed in relation to FIG. 5. similarly to the manner described in relation to FIG. 4. The 

Preferably, the dynamic optimization helper 230 provides optimized translations are written 520 into a code cache 270 

a significant number of traces (or profile data) lo the kernel allocated at the end of the kemel memory space where the 

module 240 so that dynamic optimization according to the computer operating system kerne! 220 is located, and the 

present invention can continue for the entire time allotted by 55 kernel module 240 inserts 530 appropriate jump instructions 

the resource manager 460 for such optimization. Once the in the computer operating system kernel 220 to direct 

kernel module 240 is finished generating optimized trans- execution flow of control to the optimized translations, 

lations from a particular set of profile data (either because of Again, this process is preferably repeated such that the 

a time-out or because it has utilized all of the profile data), computer operating system kernel 220 is continually opti- 

the kernel module 240 preferably reports back to the 60 mized as it executes. 

dynamic optimization helper 230 as to the number and/or When optimizing the computer operating system kemel 

nature of the optimized translations accomplished by the 220, it is preferred that some parts of the kernel 220 not be 

kernel module 240, The dynamic optimization helper 230 subject to optimization. This can be accomplished by setting 

then preferably uses that reported information to change the nonoptimizable zones (or memory addresses) within kernel 

profile data that it collects through the FMUs 90 and to make 65 space, llie kemel dynamic optimization helper 260 is then 

decisions about what profile data to analyze 330 and provide programmed to ignore profile data regarding instmctions at 

340 to the kemel module 240 subsequently. these addresses. It is preferred that interrupt handlers, the 
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dynamic optimization helper 260, and the code-rewriting 
kernel module 240 not be subject to optimization. 

The present invention has been described in relation to 
preferred embodiments. Those of ordinary skill in the art 
will recognize that modifications to the methods and appa- 
ratus described herein can be made without departing firom 
the scope of the invention. Accordingly, the present inven- 
tion should not be limited except by the following claims. 

We claim: 

1. A method for dynamically optimizing computer pro- 
grams using a code-rewriting kernel module, comprising the 
steps of: 

loading a computer program to be optimized into shared 
user-memory space; 

executing the computer program; 

analyzing the computer program as it executes, including 
the substep of collecting profile data about the com- 
puter program; 

providing selected portions of the collected profile data to 
a kernel module located in kernel memory space; 

generating at least one optimized translation of at least 
one portion of the computer program using the kernel 
module, the step of generating including using the 
kernel module to perform the substeps of: 
examining the profile data to select at least one portion 

of the computer program for optimization; and 
copying the at least one portion of the computer pro- 
gram into kernel memory space; and 

patching the computer program into shared user memory 
space using the at least one optimized translation as the 
computer program continues to execute. 

2. The method of claim 1, wherein the step of patching 
includes the following substeps: 

writing the at least one optimized translation into a code 
cache; and 

inserting at least one jump instruction into the computer 
program to direct flow of control to the at least one 
optimized translation. 

3. The method of claim 2, wherein the at least one 
optimized translation in the code cache is accessible by any 
computer process having access to the shared user memory 
space. 

4. The method of claim 1, wherein the step of generating 
includes using the kernel module to perform the following 
substeps: 

decoding the at least one portion of the computer program 
from an object code to an intermediate code; 

optimizing the at least one portion of the computer 
program based on the profile data; and 

encoding the optimized at least one portion of the com- 
puter program from intermediate code to object code, 

5. The method of claim 1, wherein the step of analyzing 
includes the further substep of reducing the profile data to 
one or more traces and the step of providing comprises 
providing the one or more traces to the kernel module 
located in the kernel memory space. 

6. The method of claim 1, wherein the step of analyzing 
is performed by a dynamic optimization helper located in the 
shared user memory space. 

7. The method of claim 1, wherein the computer program 
is a shared library, 

8. A method for dynamically optimizing a computer 
operating system kernel using a kernel module located in 
memory space, comprising the steps of: 

executing a computer operating system program kernel 
located in kernel memory space of a computer; 
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analyzing the computer operating system kernel as it 
executes, including the substep of collecting profile 
data about the computer operating system program 
kernel; 

providing profile data to a kernel module located in the 

kernel memory space; 
generating at least one optimized translation of at least 

one portion of the computer operating system program 

kernel using the kernel module; 
patching the computer operating system program using 

the at least one optimized translation as the computer 

operating system program kernel continues to execute; 

and 

loading a dynamic optimization helper into kernel 
memory space, wherein the steps of analyzing and 
providing are performed by the dynamic optimization 
helper. 

9. The method of claim 8, wherein the step of patching 
includes the following substeps: 

writing the at least one optimized translation into a code 
cache located in the kernel memory space; 

inserting at least one jump instruction into the computer 
operating system program kernel to direct flow of 
control to the at least one optimized translation. 

10. A computer system for dynamically optimizing com- 
puter programs using a code -rewriting module, comprising: 

at least one processor, adapted to execute computer 
programs, including computer programs that are part of 
a computer operating system kernel; 

a memory, operatively connected to the processor, 
adapted to store in kernel memory space a computer 
operating system program kernel and a code-rewriting 
kernel module, wherein the code-rewriting kernel mod- 
ule is adapted to receive profile data regarding a 
computer program stored in shared user memory space 
while the computer program is executing, and to opti- 
mize at least a portion of that executing computer 
program; 

a dynamic optimization helper, wherein the dynamic 
optimization helper is located in shared user memory 
space and is adapted to analyze the computer program 
as it executes, collect profile data about the computer 
program, and provide selected portions of the collected 
profile data to the kernel module located in kernel 
memory space; and 

wherein the kernel module is adapted to generate at least 
one optimized translation of at least one portion of the 
executing computer program. 

11. The computer system of claim 10, wherein the com- 
puter program is a shared library. 

12. The computer system of claim 10, further comprising: 
a code cache memory, operatively connected to the kernel 

module and located in shared user space, wherein the 
kernel module is adapted to write at least one optimized 
translation into the code cache memory and to insert at 
least one jump instruction into the executing computer 
program to direct flow of control to the at least one 
optimized translation. 

13. The computer system of claim 12, wherein the at least 
one optimized translation in the code cache is accessible by 
any computer process having access to the shared user 
memory space. 

14. The computer system of claim 10, wherein the kernel 
module is adapted to: 

examine the profile data to select at least a portion of the 
executing computer program for optimization; 
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cxtpy the at least one portion of the computer program into 

kernel memory space; 
decode the at least one portion of the computer program 

from an object code to an intermediate code; 
optimize the at least one portion of the computer program 

based on the profile data; and 
encode the optimized at least one portion of the computer 

program from intermediate code to object code. 

15. The computer system of claim 10, wherein the 
dynamic organization helper is further adapted to reduce the 
profile data to one or more traces and provide the one or 
more traces to the kernel module. 

16. A computer system for dynamically optimizing com- 
puter programs using a code -rewriting module, comprising: 

at least one processor, adapted to execute computer 
programs, including computer programs that are part of 
a computer operating system kernel; 

a memory, operativcly connected to the processor, 
adapted to store in kernel memory space a computer 
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operating system program kernel and a code-rewriting 
kernel module, wherein the code-rewriting kernel mod- 
ule is adapted to receive profile data regarding a 
computer program stored in shared user memory space 
while the computer program is executing, and to opti- 
mize at least a portion of that executing computer 
program; and 

a dynamic optimization helper located in kernel memory 
space, wherein the dynamic optimization helper is 
adapted to analyze the computer operating system 
program kernel as it executes, collect profile data about 
the computer operating system program kernel, and 
provide the profile data into the kenoel module. 
17. The computer system of claim 16, wherein the com- 
puter program is the computer operating system program 
kernel. 

* * * ♦ ♦ 
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